Machine learning with in silico analysis markedly improves survival prediction modeling in colon cancer patients

Abstract Background Predicting the survival of cancer patients provides prognostic information and therapeutic guidance. However, improved prediction models are needed for use in diagnosis and treatment. Objective This study aimed to identify genomic prognostic biomarkers related to colon cancer (CC) based on computational data and to develop survival prediction models. Methods We performed machine‐learning (ML) analysis to screen pathogenic survival‐related driver genes related to patient prognosis by integrating copy number variation and gene expression data. Moreover, in silico system analysis was performed to clinically assess data from ML analysis, and we identified RABGAP1L, MYH9, and DRD4 as candidate genes. These three genes and tumor stages were used to generate survival prediction models. Moreover, the genes were validated by experimental and clinical analyses, and the theranostic application of the survival prediction models was assessed. Results RABGAP1L, MYH9, and DRD4 were identified as survival‐related candidate genes by ML and in silico system analysis. The survival prediction model using the expression of the three genes showed higher predictive performance when applied to predict the prognosis of CC patients. A series of functional analyses revealed that each knockdown of three genes reduced the protumor activity of CC cells. In particular, validation with an independent cohort of CC patients confirmed that the coexpression of MYH9 and DRD4 gene expression reflected poorer clinical outcomes in terms of overall survival and disease‐free survival. Conclusions Our survival prediction approach will contribute to providing information on patients and developing a therapeutic strategy for CC patients.


ML-based Survival Analysis
To determine how the copy number variations (CNVs) or expression of the candidate cancer driver genes affected the clinical prognosis of patients with colorectal cancer (CRC), Kaplan-Meier survival curves were plotted for overall survival (OS) and disease-free survival (DFS) in each of the amplification and deletion groups. The cutoffs for identifying copy number changes in each sample can differ depending on the data [1,2]. We selected the cutoffs of 1%, 3%, and 5% used in a previous study [1] for determination of amplifications and deletions. For each candidate cancer driver gene, patients with extreme amplification (those with CNV segment values in the top 1%, 3%, and 5%) were labeled the amp groups, while patients with extreme deletion (those with CNV segment values in the bottom (PI). Then, the mixture was incubated at room temperature for 15 min in the dark. After incubation, 400 μl of binding buffer was added, and the samples were then analyzed by flow cytometry (BD AccuriTM C6, BD Biosciences).

Wound Healing Assay
Cells were seeded at 8x10 4 cells/well in IBIDI medium (IBIDI, Lochhamer, Germany). After 24 h, the IBIDI medium was removed, and 1 mL of culture media with 10 nM cycloheximide (Sigma-Aldrich) was added. The wound area was photographed with a microscope (Am Leitz-Park, Wetzlar, Germany).
The wound area was measured and recorded at 72 h and compared with the initial wound area at 0 h to determine the wound healing rate. The images were analyzed at each time point using ImagePro Premier 9 (Media Cybernetics).

Survival Analysis Using Patient Samples
Using data for the expression of each gene in patients, we divided patients into groups with high or low expression as previously described [3,4]. Patients with gene expression higher than the average of all patients were defined as the high group, and patients with lower expression than the average were defined as the low group. When two or three genes were combined, patients whose expression of all the combined genes was higher than each average were defined as the high groups, and the other patients were defined as the low groups. After grouping, we analyzed the OS and DFS. Survival was calculated using the Kaplan-Meier method, and comparisons were performed using log-rank tests.
We retrieved the complete datasets of CRC patients using oncomine, and thus we obtained expression data in CRC and normal colon tissues from 14 cohorts with transcriptome data including our candidate genes (Cohorts name: Alon [5], Gaedcke [6], Gaspar [7], Graudens [8] [14], Skrzypczak2 [14], and Zou [15] studies and the TCGA [16] and TCGA2 [16] datasets). Based on expression data, we calculated the fold change in CRC tissue compared to normal colon tissue and performed a meta-analysis to combine the data from diverse datasets using METAL software [17]. As a result, the gene groups were divided into genes whose expression in CRC tissues increased or decreased significantly compared to normal (z-score > 2, p-value <0.05, Supplementary Table S8). Additionally, using R2, we analyzed differences in gene expression in each stage and differences in recurrent CRC compared with nonrecurrent CRC using 5 cohorts (GSE75316, GSE37892, xin130617, GSE24551, and GSE18088).